See copyright notice at the bottom of this page.
List of All Posters
Banner Years
October 31, 2002 - Walt Davis
Sorry if I missed it, but was there any control for age in this? The "stars" decline in year 4 may well be an age effect.
What this implies is that these players were probably slightly lucky for three straight years, and that their true talent level is actually 42% above average.
Even though you used "implies" this strikes me as an overly strong and frankly illogical statement. I certainly can't think of any reason to look at 3 years of 1.49 and 1 year of 1.42 and come to the conclusion that 1.42 is our best estimate of their true level. If anything, it implies that they were unlucky in year #4 and that their true talent level is 49% above average. Though, as noted above, age and other factors are potential non-luck predictors.
I also think there may be some slight selection bias. In either the 4 or 5 year cycles, you show that those who have the banner year at the end of the cycle retain a little of that banner year. But due to your sample selection, the group whose banner year came at the beginning of the cycle didn't retain any of it in Year 2! In other words, this group consists of folks who never retained anything from their banner year in the next season, so it's not really surprising that they didn't retain anything 3-4 years later.
And I agree with the earlier poster that it probably makes more sense to measure "banner year" relative to the player. To me the question of retention is whether the player retained their improvoed performance. In the case of the ones who had the banner year at the end, we can compare to their pre-banner performance and see what they retained. But we can't do this with the folks who had the banner at the beginning of your period of ovservation. For example, what we know about those folks is:
133 102 102 101 102
and we conclude that they didn't retain anything. But perhaps with a fuller set of numbers like:
95 95 133 102 102 101 102
we'd see that they did retain some of their improvement.
I think what you really want to do is look at, say, the 3 years before and after a banner (or fluke) year. That would be better able to address what, if any, retention there is and how long it lasts.
Banner Years
October 31, 2002 - Walt Davis
For what you're doing with it, I don't think "the Hank Aaron issue" is an issue. If I understand what you did correctly, the primary issue with Hank Aaron is that he contributes numerous 3/4 year spells of "star" performance -- i.e. he appears multiple times in your data. This does create the problem that the observations are not independent of one another. However, the assumption that observations are independent is an assumption that relates to calculating standard errors and such, and is generally only a problem when you're performing statistical tests. But here you're only calculating means and not testing anything. (note, lack of independence almost always leads to larger standard errors than what you get by assuming independence, meaning the problem is usually one of finding more significance than there really is. Note also that these days most statistical packages have ways of performing appropriate tests.)
There are other potential complications I suppose. Hank Aaron may be an outlier of sorts. For example, if he contributes lots of 149 149 149 149 spells while others are going 149 149 149 132, he could be masking what is really a fairly sharp decline for every player of less caliber than Aaron. [note, banner years are essentially outliers too and the real question is always to what extent do you allow such observations to impact your analysis]
Or essentially the same thing from a different angle is that numerous Aaron spells may be pulling up the "true" level of the whole group.
Another comment on your comments: while it's true that a player that's close to 100 is more likely to be a true 100 player, I don't see why this is important. Doesn't it make the same sense just to look for players who were relatively stable for a 3 year period then saw their production jump by X% over the mean of the previous 3 years? I don't see why it should matter whether the pre-banner (or post-banner I suppose) performance be league (position?) average, I think the important characteristic is that it's stable and substantially lower than the peak.
Banner Years
October 31, 2002 - Walt Davis
In that linked HR analysis, tango wrote:
So, here is a question to ask some statisticians: Q - Is the fact that 15% of players play at the banner-level, and 15% play at the pre-banner level, in their post-banner years: can this be explained simply by random chance?
Others can follow the link to see the numbers. This is hard to say for sure just based on the data presented in the article, but we can make some guesses. For example, Hack Wilson's pre HR rate was essentially .064/AB. His peak rate was .096/AB. In the article you say that a good approximation of the post HR rate is 60% pre + 40% peak, which in this case gives us an expected post HR rate of .077. His actual post HR rate was about .04.
We can use the binomial distribution to find out how likely that is over, say, 2 seasons of 600 AB each, given these rates. There are some reasons why assuming the binomial distribution is not ideal, but it's likely a good enough approximation.
For fairly large samples, say 100 or greater (especially with small %ages at play), this can be done fairly easily by hand (well, calculator). Assuming .077 is the "true" rate and we have 1200 AB:
92.4 = .077*1200 = mean 85.3 = .077*(1-.077)*600 = variance 9.2 = standard deviation = sqrt (variance)
A rough 95% confidence interval of the expected # of "successes" is the mean +/- 2*sd. OK, technically it should be more like 1.96. Anyway, in this case, assuming a true HR rate of .077 and 1200 AB, we'd expect Hack to hit 74 to 110.8 HR. He actually hit 48. So we can safely say that Hack his significantly fewer HR than expected.
Even by "rough" standards, that still doesn't fully answer your question. If we'd hypothesized beforehand that Hack would underperform, the above would be the appropriate statistical test. But we didn't. And regardless, by using a 95% CI (or a 5% alpha), we've essentially said we're willing to accept a 5% error rate. In other words, even by random chance, there's a 5% chance that a hitter with a true HR rate of .077 will hit fewer than 74 or more than 110.8 HR in 1200 AB.
To guesstimate this, for 116 hitters in your sample, we'd expect about 6 of them to exceed their 95% CI. Note I said _their_ 95% CI, as each hitter would have a different expected post HR rate and slightly different standard deviation. You'd perform the above test for every hitter, find out how many exceeded their 95% CI, then see whether this number was substantially bigger than 6. If so, that's pretty good evidence that something non-random is also happening.
Another, I suppose easier, way to approximate this is to regress the actual rates on the expected rates. Look at the mean squared error. If the remainder is purely random, then the MSE should be about the same as the above standard deviation divided by 1200. Probably better if you calculate the standard deviation using the mean expected post HR rate but the differences will be trivial.
Bruce, Lee, and the Goose
December 19, 2002 - Walt Davis
Just a couple quick comments....
One problem we're having here, I think both in terms of assessing them by traditional standards and by Tango's new-fangled standards, is that we've yet to see a closer (or fireman) hold the job for a truly extended period of time. Smith and Gossage both did it for about 13 seasons. Maybe this means comparing them to starters isn't the best way to go. With a few exceptions, HOF starters held that job for a good 18-20 years. Relievers, at least high-leverage ones, seem rarely able to do that (Wilhelm and ???).
Which brings us to the idea of making a correction for position. 3B, 2B and especially C tend to have shortened careers. Why? I don't think we know that (except for C's), but I think most of us make some "correction" for this when evaluating their HOF standards against those of a 1B or OF. So while we don't know why relievers should have shorter careers (in fact it seems counter-intuitive), maybe we need to compare relievers to relievers.
Of course maybe the reason is that they aren't that good and that's why they ended up in the pen to begin with. So maybe we shouldn't compare them to other relievers and maybe there shouldn't be any in the HOF any more than the best 4th OF in history should be in the HOF.
Now what I think would be neat is to compare relievers to the starters of yesteryear, but only in late-inning, high-leverage situations. If we think of the evolution of the closer/fireman/bullpen, the notion is that these guys are more effective late in the game than a starter who's pitched 7-8 innings. How does Gossage compare to the performance of HOF starters in high leverage 8th-9th inning situations?
I'm assuming we don't have enough PBP data to answer that question, but it's one I'd like to have before making an HOF decision on these guys.
OPS: Begone! Part 2
May 29, 2003 - Walt Davis
Well, let me put words in Tango's mouth.
I think part of the confusion is over "game decisions". Tango appears to mean (mainly) in-game decisions. So, is OPS a better means than many managers currently use for deciding, say, who should be their regular LF? Yes, probably so.
However, within the context of an in-game decision, OPS may be as likely to lead you astray as BA or another metric.
Take the Pujols-Delgado example offered earlier. Which one you'd rather have up there in a "key" situation depends on what kind of "key" situation we're talking about.
If you're down 1 in the 9th with 2 outs, you probably want the guy with the best HR rate.
If you're down 1 in the 9th with 2 outs and a runner on first, you probably want the guy with the best XBH rate.
If you're down 1 in the 9th with 2 outs and a runner on second, you probably want the guy with the best BA.
If you're down 1 in the 9th with 2 outs and the bases loaded, you probably want the guy with the best OBP.
If you're down 2 in the 9th with nobody on base, you want the guy with the best OBP.
I say "probably" because there are other factors involved, like how fast the runner is, how good is the next hitter, who's pitching, etc.
Still you can see that for all four of those scenarios, OPS is pretty much useless. So are EQA or LWTS or probably any other meta-metric.
OPS: Begone! Part 2
June 1, 2003 - Walt Davis
Peter, to put it in the most straightforward (but slightly misleading :-) way, think of it like this:
Player A has 240 TB and 120 BB for 360 bases produced; Player B has 277 TB and 46 BB for 323 bases produced.
In essence, B is trading 74 walks for 37 TB. That's not likely to be a good trade. Generally speaking a walk is worth 2/3 of a single. So if all 37 of those extra TB are singles, those are worth (on average) about 55 walks. That still leaves Player A about 19 walks ahead.
As Tango hints, it gets a bit more complicated than that because not all TB are created equal. Also non-intentional walks are worth more than intentional ones. Using the simple LWTS formula (as good a set of values as any):
a single is worth .47 runs, or .47 r/TB, and roughly 1.5 times a walk a double is worth .77 runs, or .385 4/TB, and roughly 1.2 times a walk (on a per-base basis) a triple is worth 1.05 runs, or .35 r/TB, and roughly 1.1 times a walk a HR is worth 1.41 runs, or .35 r/TB, and roughly 1.1 times a walk
So if we were to compare 12 TB to a corresponding number of BBs, under the different scenarios:
12 singles = 18 walks 6 doubles = 14.4 walks 4 triples = 13.2 walks 3 HRs = 13.2 walks
It brings up an interesting dilemma. You're at the plate with a 3-2 count. A pitch which is a ball but which you can hammer is coming to the plate. Do you swing? Well, to produce the same number of runs (on average), you'd better be able to get a hit with that pitch about 2/3 of the time or slug close to 800, otherwise you're better off with the walk.
Obviously the above varies with the situation. Which is Tango's other point.
How are Runs Really Created
August 16, 2002 - Walt Davis
1 - Linear regression is LINEAR. Linear as in a straight line. While there is a somewhat linear relationship between runs and singles, doubles, triples, and walks, there is NOT a linear relationship between runs and HR, or runs and everything else like SB, WP, BK, etc. Baseball is non-linear.
Not exactly. Linear regression is linear because it's "linear in the parameters." There are many ways to model non-linearities among the variables using multiple regression. For example:
y= b0 + b1*X + b1*X^2 + e
(where X^2 means X-squared) will model a nice quadratic curve for you. Or something like:
y= b0 + b1*out2 + b2*out3 + b3*X + b4*X*out2 + b5*X*out3 + e
where out2 and out3 are dummy variables representing different out situations (out1 is the "omitted category") would model 3 different lines for each out situation. Or even something like:
y = b0*X^b1*e
can be converted into a linear in the parameters equation by taking the natural log:
ln(y) = ln(b0) + b1*ln(X) + ln(e).
There's no problem mixing linear and non-linear terms in the same equation (assuming you have the rationale to back it up):
y = b0 + b1*ln(X1) + b2*X2 + b3*ln(X1)*X2 + b4*X3 + b5*X3^2 + e
2 - The independent variables are not independent. There is an interdependence between all these variables. A walk is only worth what it is because of the other things that happen. Linear regression attempts to "freeze" all the other variables when calculating the value of the unfrozen variable. As your run environment increases however, we know that the values of these variables change. Baseball is interdependent.
Your terminology here is unclear. First, "independent variables" need not be independent of one another -- statistically, independence means not correlated, whereas multiple regression is useful precisely because the variables in the model need not be independent of one another (as long as they aren't perfectly correlated). In regression, a coefficient gives you the impact of adding that particularly variable to the model, after having removed all the influence of the other variables from both the dependent variable and the independent variable in question (aka "statistical control"). I don't see any inherent problem with doing that here, but perhaps I'm missing something.
Now it is true that the coefficient estimates "control" for the effects of the other variables, and there may be times when such control is impossible. For example take the equation above:
y = b0 + b1*X + b2*X^2 + e
Now obviously we can't hold X^2 constant while changing X. But this doesn't mean that linear regression is an inappropriate model here, it just means it doesn't make sense to use b1 alone to estimate the impact of a change in X. Instead, using derivatives, we can see that the impact of a change in X is given by:
dy/dx = b1 + 2*b2*X
But I'm really not seeing what this has to do with how slopes change by run environment. Modeling that would suggest other possibilities like a series of dummy variables representing different run-scoring eras or a multi-level random effects model.
3 - Even if you assume for ease that run creation is linear and independent (a safe assumption for very controlled environments), what sample data will you use to run your regression against? Most people will use team season totals, which is an aggregate of individual games, which is an aggregate of individual innings. If you want to run a proper regression analysis, at the very least run it on a game or inning level. Your sample size will explode to something much more reliable.
This is a good point, but of course this has nothing to do with the appropriateness or inappropriateness of multiple regression, but rather with what the proper unit of analysis is. Mulitple regression is an extremely robust technique and sample sizes of a few hundred are quite sufficient -- in fact, sample size plays no role in the assumptions that make "ordinary least squares (OLS) regression" the "best linear unbiased estimator" as long as you have more cases than variables. The sample size will impact the magnitude of your standard errors -- i.e. if you have a small sample, your errors may be so large that you can't say anything with precision.
The inappropriateness of what you're talking about here is in applying coefficients derived from team-level regressions to individual-level data, and you're completely correct that this is statistically inappropriate. On the other hand, I'm fairly confident that aggregation bias is fairly small in the particular example of baseball.
The upshot being that it's not inappropriate to use a linear regression model at the team level nor would it be inappropriate to use one at the game level. It is inappropriate to use the coefficients from one to derive estimated run values for the other.
The above point also suggests that a multi-level or other model that controls for autocorrelation is the appropriate method to use -- such a model would produce similar coefficients but would produce more reliable standard errors.
On the other hand, if we model at the game or inning level, then it is no longer kosher to assume a continuous dependent variable which means that OLS is no longer an appropriate model. A poisson or negative binomial distribution is probably a reasonable fit, and it's really not that hard to fit autocorrelational structures to these models either. (poisson and negative binomial aren't easy to explain, but basically they are appropriate where the possible outcomes consist of a small set of integers -- such as runs scored per game or inning. As the number of possibilities increase, the difference between these distributions and a continuous one are usually trivial. And for the record, poisson and negative binomial models are also linear in the parameters.)
4 - Not accounting for all the variables. Triples have a strong relationship to speed. If you don't have SB in your sets of variables, the regression analysis will award more weight to the triples as a stand-in (because of its relationship to steals). It is possible, based on some samples, that the value of a triple could exceed the value of the HR! What other variables are you not accounting for?
This of course is a problem for any model, including yours. After all, the value of stealing 2nd with 2 outs is dependent not just on who's on the mound, but who's at bat and on-deck, how often do they hit singles, how good are the outfielders' arms, how good is the catcher at blocking the plate, the score and inning of the game (will the outfielders throw home?), who's warming up in the bullpen, etc. That's a lot of conditional RE tables. :-)
Essentially by definition, any model is only good at giving you an "average" or "typical" sort of estimate. To get more precise requires such a high degree of specificity that it's no longer a model.
Note that, technically speaking, not accounting for all variables in a regression is only a problem if (1) those variables impact the dependent variable AND (2) the omitted variables are correlated with one or more independent variables. As you note, triples and steals will be correlated and therefore omitting one would be a problem. However, if things like HBP and interference are truly random, omitting them from the model will not bias the coefficients for variables included in the model.
All this aside, chances are none of this will have much impact. Baseball scoring is not all that variable, most of the important variables have been identified, etc. Chances are the best we can hope for is minor improvement in the level of error. The proof's in the pudding there and I hope a future article will compare the accuracy of your method to the existing ones.
How are Runs Really Created
August 16, 2002 - Walt Davis
Mike T said pretty much anything I was going to say to Tango. I will add that I don't mean to be disagreeing with Tango here -- there are lots of ways to model things and linear regression is certainly not always the best solution (as Mike points out). I just wanted to clarify what I thought were some misconceptions about what linear regression can and can't do.
I will simply add that the sorts of autocorrelation models that I discussed are not really all that tough as Tango seems to think. The Stata package has several nice ways of dealing with these models very easily, as does LIMDEP (though LIMDEP is kind of a pain overall). S+, SAS, and maybe even SPSS can also handle these models, as well as more specialized packages such as HLM, MLn, and I'm sure somebody has written Gauss and mathematica modules for them as well. Not that there's any reason for sabermetricians to really dive deep into this stuff ... just that I'm a bit disappointed to find folks who are willing to calculate run expectations tables for something like 60,000 different scenarios would be reluctant to include a few dozen dummy variables and interaction terms in their equations. :-)
How are Runs Really Created
August 16, 2002 - Walt Davis
Oh, one other thing...
2 - John Jarvis did a regression analysis on I believe the 1976-2000 TEAM SEASONAL totals and came away with a regression value of .62 (or something) for a double, and .87 (or so) for a triple. Those values are nonsensical in reality. It doesn't matter that his r-squared was 90% or that the standard error was very low. It's wrong. I've done regression analysis on team totals by era, and the results also were strange in some cases.
Well, actually, what the standard error was is always important. And it's important to know what's meant by "low." I'm not familiar with this research, but let's for the moment assume that the standard error for the doubles coefficient was .15. That would make the doubles coefficient highly statistically significant (t>4). But it would also give us a 95% confidence interval of roughly .32 to .92. While .92 seems obviously too high, .32 is too low, but we're pretty confident that the real value of a double is in there somewhere.
Now, I've gotten too technical already, but it's also likely that Jarvis did not control for the autocorrelation of the data. I haven't really explained this yet, so let me try. In OLS regression, we assume that the observations are independent of one another -- i.e. not correlated. In the case of team seasonal regressions, it's not a safe assumption that the 1976 Yankees are independent of the 1977 Yankees. They won't be identical, but to suggest they're unrelated is a stretch.
The thing is, if your data have autocorrelation and you don't correct for it, then your standard error estimates are biased (it may also be the result of omitted variables which could introduce bias). Generally speaking, your standard error estimates will be smaller than they should be and therefore your confidence intervals narrower than they should be. So while the doubles coefficient may not have a reported .15 standard error, after correcting for autocorrelation, we might find out that .15 is its 'true' standard error.
And back to an earlier point. The .62 coefficient for doubles should only be applied to team-seasonal data, not individual data or game data or what have you.
Finally and somewhat self-contradictory, I suspect Tango is right in this particular example. Those coefficients do suggest that there's something wrong with the model. Whether that's omitted variables or the wrong functional form (i.e. we need to add some non-linearities) I can't say.
SABR 301 - Talent Distributions (June 5, 2003)
Discussion ThreadPosted 7:32 p.m.,
June 5, 2003
(#14) -
Walt Davis
mostly a non-sequitir, but let me just jump in and say that normal distributions are anything but typical. About the only place in the real world where you find normal distributions are in physical/biological matters (e.g. height and quite possibly baseball talent) and large enough samples of something close enough to a binomial distribution. To most statisticians, the real-world normal distribution is a bit like the holy grail.
I bring this up only because this is at least the third time I've seen a sabermetrician mention how common or typical a normal distribution is.
Making Money (June 23, 2003)
Posted 5:48 p.m.,
June 26, 2003
(#5) -
Walt Davis
Upon re-reading that article, it occurs to me that per-capita income is not quite the correct variable. Per-capita income is much higher in the NY metro area than it is in KC ... but so is the cost-of-living. We should probably add a COL variable to the equation or somehow adjust PCI for the COL.
Reliever Usage Pattern, 1999-2002 (June 24, 2003)
Posted 5:04 p.m.,
June 25, 2003
(#15) -
Walt Davis
I think one thing that would help in terms of presentation is to give some examples or "ideal types" of the different leverage ranges. What's a "typical" 0-.5 leverage situation? It will help the non-statistical get a better idea of what this really means.
Obviously at this point it would be great to look at changes in these percentages over time (from Sutter to today, e.g.).
I'm curious as to why the .5-1 range has such low percentages. Any idea why these situations aren't very common?
Reliever Usage Pattern, 1999-2002 (June 24, 2003)
Posted 6:24 p.m.,
June 25, 2003
(#18) -
Walt Davis
I think Jason makes a good point -- it would be interesting to see how leverage differs by team based on record. I'm not sure there will be big differences. Leverage comes in close games and it seems that all teams play about the same number of close games. The difference between good and bad teams is often that good teams win lots of blowouts and bad teams lose lots of blowouts.
So I guess another way to look at is what, if any, is the relationship between the distribution of leverage and the distribution of game run differential? If 50% of ML PA have little leverage that, kinda, suggests that 50% of ML games are blowouts.
Redefining Replacement Level (June 26, 2003)
Posted 5:31 p.m.,
June 26, 2003
(#7) -
Walt Davis
Well, this is better than the other two Silver articles that he brags about at the beginning of the piece (and he misstates his own conclusion from the Coors piece which was that high-K, low-BB hitters benefit). (I bring that up only because I almost stopped reading the piece at that point)
Nevertheless, the problems with defining replacement level are just another reason to measure relative to average. Sure, there are advantages to replacement level, but they quickly disappear if either you don't know where replacement level really is or calculating proper replacement level becomes cumbersome.
By the way, average C EQA for 2003 is listed as 254, and Flaherty's is listed as 244 (not 220 as Silver wrote), which puts him just below average. That's better than Miguel Olivo, Geronimo Gil, and even Bobby Estalella and Charles Johnson. In fact, by my count, he's got a higher EQA than 11 starters and 16 backup Cs. Last year the average C EQA was 246 and his EQA was 239, again just below average, so it's not a complete fluke. Sad as it may be, he would seem to be well above replacement level (at least on offense).
Which gets us back, in its way, to Nate's (and others) point about problems with replacement level. According to BP's site, 72 C's have played in the majors this year and 30 (40%) of them are below replacement level. Unless there are at least 18 above-replacement level C's in the minors, we would seem to clearly have replacement level set too high. Otherwise it is in fact nearly impossible to get a replacement-level replacement C.
Results of the Forecast Experiment, Part 2 (October 27, 2003)
Posted 11:31 a.m.,
October 30, 2003
(#70) -
Walt Davis(e-mail)
Holy Moly! I don't even remember doing this and I'm second. I'm a freakin' genius.
Tango, would you either e-mail me or post my picks? I'm flabbergasted -- I usually suck at this sort of thing.
Results of the Forecast Experiment, Part 2 (October 27, 2003)
Posted 12:07 p.m.,
October 30, 2003
(#71) -
Walt Davis
As far as the 5%, 25%, etc. levels, such as Pecota does, personally, I don't think anything other than using regular old z scores are appropriate (IOW, if you have a .700 OPS projection, then there is a 5% chance that that player would have an OPS of greater than 2 SD above or below .700, where one SD is based on one year's worth of projected PA's. Anything other than that (such as what Pecota tries to do), is BS I think (I am not sure)...
Well, whether PECOTA does it correctly is another question. But I would be very surprised to find that performance had a normal distribution. Such beasties are in fact quite rare in reality.
First, with the exception of Barry Bonds, there seems to be an upper limit to how well a player can perform. Secondly, and probably more importantly, there's a lower limit below which a player will not be allowed to perform.
I don't know what the standard error is on an OPS prediction, but +/- 60 points wouldn't surprise me. Well, if the baseline prediction is a 700 OPS, it would seem to be silly to say that the 95% confidence interval is 580 to 820. Unless it's Cesar Izturis or the Tigers, the player's not going to be allowed to post a sub-620 OPS for very long.
Or take Jason Giambi. Even with an age adjustment, I'd guesstimate that his 2003 baseline prediction would have been at least a 1000 OPS. But should anyone think that, given his age and weight, that an 1120 OPS was as likely as an 880 OPS? That doesn't make sense to me.
Finally, especially with older players and maybe the youngest ones too, it would seem that their chances of significantly underperforming their prediction is greater than the chance of overperforming it. In short, I'd imagine that age increases the chances of falling off a cliff. And while that is covered somewhat by the age adjustment, I doubt it covers it sufficiently. And, paradoxically, if it did cover it sufficiently, the chances of overperforming would probably be greater than the chances of underperforming ... which means we aren't talking a normal distribution.
Now, having said all that, I suspect that the substantive impact of a symmetric vs. a non-symmetric error distribution is probably trivial.
Results of the Forecast Experiment, Part 2 (October 27, 2003)
Posted 4:00 p.m.,
November 3, 2003
(#80) -
Walt Davis
One of the problems is that there are two factors which determine what the "curve" will look like - one, the distribution of a binomial (will the player get a hit, a walk, a home run, etc., in each PA or won't he?),
There are a few potential problems with this. First, if we are talking PA, we are possibly talking a Bernoulli distribution. That is a trivial distinction, but the binomial distribution is the result of a series of independent Bernoulli trials with a constant probability.
Now independence seems close enough to being true -- there is little if any evidence that the outcome of a player's last PA impacts the outcome of their current PA (e.g. streaks, etc.). However we know that the p-level is not constant from PA to PA for a batter -- positive outcomes are far less likely against Pedro Martinez or in Dodger stadium. Hence we know that the binomial is not the exact distribution of these outcomes. It's probably not far off, but it's definitely not the perfect assumption to make.
Far more importantly, a PA is not a Bernoulli trial. A Bernoulli trial is an event with only two outcomes. The chances of getting a hit vs. not getting a hit may be a Bernoulli trial. But there are multiple possible outcomes (single, double, triple, etc.), each of which makes the other outcomes impossible. This is either a conditional probability problem (i.e. first was it a hit, next what kind of hit was it) or it's a multiple outcome problem. Regardless, each PA is NOT a Bernoulli trial because Bernoulli trials only have two outcomes. No Bernoulli, no binomial. No binomial, no straigtforward leap to a normal distribution.
When modeling the probability of multiple outcomes like this, one usually either assumes a "probit" or a "logistic" model. In both (or any other generalized linear model), you assume that there is an underlying continuous (unmeasurable) variable which determines the likelihood of the varying outcomes. In probit, this underlying variable is assumed to be normal; in logistic, it is assumed to follow a slightly different distribution.
For any individual outcome, a poisson or negative binomial distribution may be as good or better a fit than the binomial. These distributions are often considered better for rarer events, though I'll admit I know that only as a rule of thumb, I've never seen any research on it (nor have I ever looked for any).
Thirdly, what is being projected? Are sabermetricians actually trying to simultaneously predict the number of singles, doubles, triples, HRs, BBs, Ks, HBPs, and outs? Or are they trying to project an OPS or VORP or win shares or some other single measure of value/performance? If the former, I've yet to notice anyone talking about running multinomial or conditional logit and probit models. Moreover, we'd essentially be talking about multi-dimensional space, so "normal" would be an odd way to describe the error distribution. If the latter, I remain unconvinced that there's any good reason to think it follows a normal distribution.
As a brief and very tiny example, let's look at the players in this fun little project.
The baseline estimate overestimated 13 of 21 hitters. The average misestimation was 30 points of OPS over actual performance. Or to put it another way, the average projected OPS for these 21 hitters was 846, the actual was 816. Only 1 hitter outperformed his projection by 100 points or more of OPS; 5 hitters underperformed their projection by 100 points or more.
Things aren't much different for the forecasters, except they overestimated only 11 of 21. But the average misestimation was still 30 points of OPS over actual performance, and their mean projection was pretty much identical.
Things are essentially the same for pitchers, where ERA was underestimated by .36 by the baseline and .53 by the forecasters.
So, for this small sample, assuming that the confidence interval would be a normal distribution with a mean equal to the projected performance would clearly have been a bad idea. Whether the "problem" is a mean bias in the forecasts or asymmetric confidence intervals is an open question. Of course, these players were chosen specifically for their "uncertainty". But there's no question that, this year, a symmetric confidence interval for these uncertain players would have been a bad choice.
ALCS Game 7 - MGL on Pedro and Little (November 5, 2003)
Posted 5:22 p.m.,
November 6, 2003
(#11) -
Walt Davis
You've all heard how Pedro doesn't have the stamina at the over 105 pitch count limit right?
Please click on the above link.
I'm not sure what we were supposed to click through to or how much we were supposed to read. But in the linked thread there's a link to Pedro's career numbers over 105 pitches and they're outstanding.
But I don't think anyone is saying that Pedro has never been able to pitch past 105 pitches. The man used to be one of the more durable starters in the game.
And to all you analysts who based your opinions on 100 PA: you should know better.
Well, sometimes that's all you have. And while the mean may not be a particularly reliable estimate, it's still the best estimate of central tendency we have (unless you want to go for the median) regardless of sample size.
In your post on that other thread, you gave Pedro's 2001-2003 #'s after 105 pitches:
21 ip, 25 hits, 12 walks, and 26 K, with 0 HR and 9 ER (10 R).
Here are his numbers pre-2001:
148 ip, 119 hits, 37 walks, 176 K, 9 HR, 41 ER.
The hit rate and walk rate are much, much higher the last three years. I don't have batters faced or I'd do a quick test, but I suspect they are statistically significantly different. The ERA is certainly much higher (3.86 vs 2.49).
I dunno, but Pedro (Pedro!) walking 12 guys in 21 innings looks like a clear sign of trouble to me. That's a 5.1/9 walk rate for a guy with a career 2.4/9 walk rate. 37 baserunners in 21 innings is a WHIP of 1.76 ... enough to make Jose Lima blush and enough to make us think that 3.86 ERA is a bit lucky (or the result of good bullpen support). Small sample or no, that screams ineffectiveness for a pitcher of Pedro's quality. Could it be random? You betcha. Is it likely to be purely random? No.
Was it "obvious" Pedro was tired? That's a legit question. But in defense of all those folks out there who were "FULL OF CRAP", we'll note that most of them were screaming for Pedro to be taken out well before the outcome was known. It's completely unfair to accuse these folks of judging it based on the outcome.
What would they have said if he'd gotten through that inning? Well, I've been in that situation plenty of times and I can assure you that, unless the pitcher does something like blow the next 3 batters away, I say "boy, they got lucky there."
Clutch Hitting: Fact or Fiction? (February 2, 2004)
Posted 10:54 a.m.,
February 3, 2004
(#11) -
Walt Davis
The difference between a "one standard deviation good" and an average clutch hitter amounts to only 1.1 successful appearances, while the difference between a good and an average overall hitter amounts to 3.9 successful plate appearances. In short, any argument that clutch skills should be ignored could equally well be an argument that all batting skill should be ignored in clutch situations, given that randomness is the largest factor of all.
I would assume this is what Charles is referring to. A clutch hitter will have 1.1 more successful PAs in about 150 clutch situations in a season. Those 1.1 additional successful PAs will result in maybe one additional run created (on average) per year between your clutch and non-clutch hitters. Now those are fairly high leverage runs and maybe worth something like .5 wins each.
Moreover it would seem fairly easy that differences in pitchers faced or randomly distributed differences in base/out scenarios might explain such a small difference (i.e. if the average batter does better with a man on 1st and noone on 2nd due to the 1B holding the runner on, then batters who randomly had more of these clutch situations would do better). At the very least, we'd think that at least some of the variation in clutch performance is due to variation in these factors, meaning the "true" clutch effect is likely even smaller than this.
I'd imagine that in clutch situations, the base/out/deficit situation would be treated differently by different hitters. Here "clutch" includes cases where the tying run is at first base, or in the batters box, or on-deck. I'd think a Gwynn-type hitter would approach these situations the same way, but if the batter represents the tying run or if they tying run is on 1B, a Thome-type hitter is looking for a HR or at least a double. Rather than Gwynn-types being clutch hitters, perhaps Thome-types are making a conscious (and perhaps correct) decision to sacrifice some OBP for some SLG. In other words, if Gwynn's 1.1 extra successes are singles, Thome only needs one extra HR than he would normally hit to make up that difference.
Another issue to address is statistical power. Statistical power is the probability that a test statistic detects an effect of size X in a sample size N (at a given alpha level). While it would seem great to have lots of power, there's the downside that with enough power, even trivial differences will achieve statistical significance. In other words, with a big enough sample size, everything is significant.
If I weren't lazy, I'd look up the proper formula for power in the binomial. Instead I did a quickie simulation. I simulated 10,000 careers of 612 players with a "regular" OBP of 328 where each had 1000 career clutch PAs (is this reasonable?) and 1/3 had "clutch" OBPs of 326, 1/3 328, and 1/3 330. For each set of 612 careers, I determined the number of players who had significantly more or fewer hits (at the .05 level) assuming a .328 OBP (i.e. no clutch). In any given set of 612 careers, we'd expect 5% of batters (or 30.6) to exceed that level due to randomness.
I then looked at the distribution of this count across the 10,000 seasons, which ranged from 13 to 54. Given 612 trials and a p of .05, a count higher than 40 should occur in about 3.8% of the sets of 612 careers. However, in our simulation, we get a count of 40 or higher about 6.7% of the time.* In other words, we would easily reject the null hypothesis that there's no difference in clutch OBP. We would conclude this even though the diffrences are in fact quite trivial.
Which is just a way of saying that the p-values reported in this article aren't necessarily impressive, they're just reflective of the power of the test. The magnitude of the effect is just one of the things that impacts the power of the test. The magnitude of the effect in the study is greater than the ones I used above, but still quite small.
* there appears to be some bias in my random number generator such that even with no effect simulated, I get about 4.4% (insted of 3.8%)with a count above 40....or I shouldn't have used the binomial to generate that 3.8% expected value. But 6.7% is still 50% above 4.4% and we'd conclude a significant difference. The other possibility is that the probability function is off, meaning the 3.8% number is too low.